Skip to content

feat: [anvil dx] add anvil datasets to google datasets catalog (#4807)#4831

Open
frano-m wants to merge 2 commits into
mainfrom
fran/4807-anvil-dx-google-datasets-jsonld
Open

feat: [anvil dx] add anvil datasets to google datasets catalog (#4807)#4831
frano-m wants to merge 2 commits into
mainfrom
fran/4807-anvil-dx-google-datasets-jsonld

Conversation

@frano-m
Copy link
Copy Markdown
Contributor

@frano-m frano-m commented May 13, 2026

Summary

Adds Schema.org Dataset JSON-LD to AnVIL CMG dataset detail pages so Google Dataset Search can index them. Mirrors the HCA pattern from #4806 — same shared JsonLd component, same shared buildDescription/escapeJsonForHtml/stripHtmlTags/truncateDescription/uniqueNonEmpty helpers, same <JsonLd> mount path.

New files:

  • app/utils/schemaOrg/anvilDataset.tsbuildAnvilDatasetJsonLd(data, browserURL) for AnVIL DatasetsResponse.
  • __tests__/utils/schemaOrg/anvilDataset.test.ts — 11 unit tests.

Modified:

  • app/utils/schemaOrg/utils.ts — generalised buildDescription (was local to hcaProjectDataset.ts) so HCA + AnVIL + future LungMAP can all reuse it. Takes a caller-owned fallbackSuffix so each consumer controls its own padding phrasing.
  • app/utils/schemaOrg/hcaProjectDataset.ts — uses the shared buildDescription instead of declaring locally. Net -31/+5 lines.
  • pages/[entityListType]/[...params].tsx — unified renderJsonLd<T>(props, entityListType, build) generic helper replaces the per-consumer renderHcaProjectJsonLd / renderAnvilDatasetJsonLd. Mount path is now {isAnVIL && renderJsonLd(props, "datasets", buildAnvilDatasetJsonLd)} and {isHcaDcp && renderJsonLd(props, "projects", buildHcaProjectJsonLd)}. [LungMAP] Add LungMAP projects to Google Datasets catalog #4808 (LungMAP) drops in as a one-liner on top.

Closes #4807. Stacked on #4829 (the HCA PR). Once #4829 merges, rebase this PR's base to main.

Ticket scope audit (MVP)

Field Status
name, description (required) ✅ implemented
identifier, url, sameAs, includedInDataCatalog, isAccessibleForFree, keywords ✅ implemented
creator ⏸ no consortium field on DatasetEntity; deferred
funder, license, distribution, variableMeasured ⏸ explicitly TBD in issue
measurementTechnique ❌ MVP gap (parity with HCA PR)

sameAs URLs go to dbGaP (https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=<phs>) — AnVIL's only external reference type, no identifiers.org mapping needed. isAccessibleForFree: true for all AnVIL datasets per Google's spec (the flag is the inverse of "paid", not "unrestricted access"; dbGaP gating doesn't make a dataset "paid").

Test plan

  • npx tsc --noEmit passes
  • npm run lint, npm run check-format pass
  • npx jest __tests__/utils/schemaOrg — 25/25 tests pass (14 HCA + 11 AnVIL)
  • npm run build:anvil-cmg succeeds; 375/422 dataset detail pages emit JSON-LD (47 omissions are sub-tab/export routes where processEntityProps short-circuits — same gating as HCA's project pages)
  • npm run build-ma-dev:hca-dcp — HCA still emits JSON-LD (110/116 project pages)
  • npm run build-dev:lungmap, npm run build-dev:anvil-catalog — clean builds, no JSON-LD (correctly gated)
  • Validate output against Google's Rich Results Test and Schema Markup Validator for representative datasets (open access, controlled access, multi-consortium) after deploy
  • Request indexing via Google Search Console post-merge

🤖 Generated with Claude Code

@frano-m frano-m changed the base branch from fran/4806-hca-dcp-google-datasets-jsonld to main May 13, 2026 12:13
@frano-m frano-m force-pushed the fran/4807-anvil-dx-google-datasets-jsonld branch from eca1901 to d2aa49e Compare May 14, 2026 05:47
@frano-m frano-m requested a review from Copilot May 14, 2026 05:48
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Schema.org Dataset JSON-LD to AnVIL CMG dataset detail pages so Google Dataset Search can index them, mirroring the HCA pattern from #4829. The PR also generalises a description helper and consolidates per-consumer JSON-LD render wrappers into a single generic helper to make the upcoming LungMAP integration (#4808) a one-liner.

Changes:

  • New buildAnvilDatasetJsonLd builder (+ 11 unit tests) covering required Dataset fields, dbGaP sameAs URLs, and aggregated keyword union.
  • Promoted buildDescription from hcaProjectDataset.ts into the shared schemaOrg/utils.ts with a caller-owned fallbackSuffix.
  • Replaced per-consumer renderHcaProjectJsonLd with a generic renderJsonLd<T>(props, entityListType, build) and mounted it for both isAnVIL/datasets and isHcaDcp/projects.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
app/utils/schemaOrg/anvilDataset.ts New AnVIL Dataset JSON-LD builder, including keyword/sameAs helpers.
app/utils/schemaOrg/utils.ts Adds shared buildDescription previously local to HCA.
app/utils/schemaOrg/hcaProjectDataset.ts Migrates to shared buildDescription; net code reduction.
pages/[entityListType]/[...params].tsx Generic renderJsonLd<T> helper; mounts AnVIL + HCA JSON-LD conditionally.
__tests__/utils/schemaOrg/anvilDataset.test.ts New 11-case unit test suite for the AnVIL builder.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pages/[entityListType]/[...params].tsx
Comment thread pages/[entityListType]/[...params].tsx
Comment thread app/utils/schemaOrg/anvilDataset.ts
Comment thread app/utils/schemaOrg/anvilDataset.ts
@frano-m frano-m marked this pull request as ready for review May 14, 2026 07:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[AnVIL DX] Add AnVIL datasets to Google Datasets catalog

3 participants